Skip to content

feat(gemma): eager Q5_K packed path + Kotlin/Native board load path#176

Merged
michalharakal merged 6 commits into
developfrom
feature/gemma-q5k-eager
Jun 14, 2026
Merged

feat(gemma): eager Q5_K packed path + Kotlin/Native board load path#176
michalharakal merged 6 commits into
developfrom
feature/gemma-q5k-eager

Conversation

@michalharakal

Copy link
Copy Markdown
Contributor

Wires the new SKaiNET Q5_K packed kernel into the eager Gemma runtime and adds the Kotlin/Native (board) weight-load path, so FunctionGemma-270M (Q5_K_M) runs eager with KV-cache + in-kernel Q5_K dequant — no FP32 inflation.

Depends on SKaiNET-developers/SKaiNET#734 (the Q5_K kernel + K/N cinterop). Until that's released, this consumes a locally-published sk.ainet.core:*:0.29.1 via mavenLocal() (added here).

What's here

  • Eager Q5_K (JVM): GemmaMemSegConverter keeps Q5_K weights packed (Q5_KBlockTensorData, 176 B/block) instead of dequantizing to FP32 — runs the in-kernel dequant matmul.
  • commonMain board path: GemmaQuantLayout.kt (relayoutKSeriesRowMajorToBlockMajor + logicalShapeFor + packGemmaKQuant) and GemmaPackedWeights.kt (convertGemmaWeightsPacked + extractRawBytes), the K/N analogue of the jvmMain MemSeg converter (no java.lang.foreign). Wired into GemmaNetworkLoader.load(NATIVE_OPTIMIZED).
  • Bump skainet 0.28.1 → 0.29.1; mavenLocal() first in settings (Central fallback).

Verification

  • GemmaQuantLayoutTest: relayout + packing + byte-extraction round-trip green on JVM and linuxX64 (native byte extraction executes, not just compiles).
  • GemmaQ5KPackedParityTest: FP32 baseline, jvmMain MemSeg-packed, and the wired load(NATIVE_OPTIMIZED) path all decode FunctionGemma to the identical token sequence → <tool_0>(state="on")<end> for "Turn the light on."

Remaining (board)

Full on-device FunctionGemma decode on the SL2610 (build the gemma stack for linuxArm64, run on device) + benchmark vs the IREE path.

🤖 Generated with Claude Code

michalharakal and others added 6 commits June 10, 2026 23:41
FunctionGemma-270M ships as Q5_K_M, but GemmaMemSegConverter dequantized
Q5_K weights to FP32 on load ("no native matmul kernel yet for Q5_K"),
losing the memory savings and the in-kernel dequant. Upstream SKaiNET
0.29.1 now provides a first-class Q5_K packed matmul (Q5_KBlockTensorData
+ Q5KMatmulKernel: scalar/Panama/native), so keep Q5_K packed here too:
relayout GGUF bytes to block-major + wrap as Q5_KBlockTensorData (176 B/
block). Dispatch + lazy transpose reach it via DefaultCpuOps.

- Bump skainet 0.28.1 -> 0.29.1 (source-of-truth for the llm-bom platform).
- settings.gradle.kts: mavenLocal first so a locally-published SKaiNET
  0.29.1 (carrying the in-progress Q5_K kernel) shadows Maven Central until
  it's released; Central remains the fallback.

Verified (GemmaQ5KPackedParityTest, -PincludeIntegration): the Q5_K packed
path decodes FunctionGemma byte-identically to the FP32 baseline —
[262146, 236769, 3255, 718, 498, 1373, 262152, 106] -> `<tool_0>(state="on")
<end>` for "Turn the light on." (the known-good tool call), 0.81 tok/s on
the JVM host incl. prefill.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ard path

The board binary is Kotlin/Native, but GemmaMemSegConverter (the NATIVE_OPTIMIZED
packed-weight path) is jvmMain-only (java.lang.foreign). Move the reusable,
platform-neutral pieces to commonMain so K/N can keep K-quant weights packed:

- GemmaQuantLayout.kt (commonMain): logicalShapeFor + relayoutKSeriesRowMajor
  ToBlockMajor (now copyInto, KMP-safe) + packGemmaKQuant<T>() which builds
  heap-packed Q4_K/Q5_K/Q6_KBlockTensorData directly (no MemSeg/Arena).
- GemmaMemSegConverter (jvmMain) now shares those commonMain helpers (dup
  removed); MemSeg/FFM conversion + FP32 fallbacks stay JVM-only.
- commonTest GemmaQuantLayoutTest: block-transpose relayout + packing, runs on
  every target.

Verified: gemma compiles for JVM + linuxX64; layout tests green (3).

Next (board integration): a commonMain convertGemmaWeightsPacked wired into the
K/N load path (byte extraction differs JVM IntArrayTensorData vs native Byte-
backed), then a full K/N decode on the SL2610.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oad()

NATIVE_OPTIMIZED loads produce raw-byte quant tensors the network mapper can't
consume; on JVM an external convertGemmaWeightsToMemSeg (FFM) handled that, but
the Kotlin/Native board has no such path. Add a commonMain converter and make
load() apply it, so load(NATIVE_OPTIMIZED) yields a runnable network on the
board AND the JVM (previously it couldn't be built from raw-byte weights at all).

- GemmaPackedWeights.kt (commonMain): convertGemmaWeightsPacked — packs
  Q4/5/6_K matmul weights to heap Q*_KBlockTensorData (packGemmaKQuant),
  dequants token_embd/output to FP32 (gathered, no transpose) and other quant
  types to FP32 [out,in]. No java.lang.foreign. Plus extractRawBytes, which
  reads the loader's bytes back across both backings (JVM IntArrayTensorData /
  native Byte-typed).
- GemmaNetworkLoader.load(): for NATIVE_OPTIMIZED, run convertGemmaWeightsPacked
  before applyWeightsToNetwork.

Verified on JVM AND linuxX64 (GemmaQuantLayoutTest, 4 tests each): relayout,
packing, and the byte-extraction round-trip — so native byte extraction is
executed, not just compiled.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends GemmaQ5KPackedParityTest to also decode via
GemmaNetworkLoader.load(NATIVE_OPTIMIZED) — the wired commonMain
convertGemmaWeightsPacked (board) path, no MemSeg/Arena. All three paths
(FP32 baseline, jvmMain MemSeg-packed, load() packed) produce the identical
token sequence -> `<tool_0>(state="on")<end>` for "Turn the light on."

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Six real-model integration tests (RealGemmaLoad/Eager/BakeIrpa/ExternalParam/
DequantDump + GemmaBehavioralAb) pointed at an old workspace path
(/home/miso/projects/coral/sl2610-voice-cc-kt/models/...) and failed with
"File not found" under -PincludeIntegration. Repoint them to the actual model
location (SKaiNET-embedded/sl2610-function-calling/models/), matching
GemmaQ5KPackedParityTest.

Verified: all 6 pass against skainet 0.30.0 (mavenLocal), -PincludeIntegration.
@michalharakal michalharakal merged commit 0406dc6 into develop Jun 14, 2026
0 of 2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant